Our R Configuration is as follows:
sessionInfo(package=NULL)
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.4
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] backports_1.0.5 magrittr_1.5 rprojroot_1.2 tools_3.3.2
## [5] htmltools_0.3.5 yaml_2.1.14 Rcpp_0.12.10 stringi_1.1.3
## [9] rmarkdown_1.4 knitr_1.15.1 stringr_1.2.0 digest_0.6.12
## [13] evaluate_0.10
Our original data is sourced from Data.World. This data includes visits by National Parks from 1904-2016. We joined this data with population and gender breakdown by state from the US Census Data. Finally, we joined Latitude and Longitude points for each of the 50 U.S. States, as well as Washington D.C.
Our initial analysis was performed in Tableau.
We began by creating a boxplot that showed how national park visits differed by operating region. The below image shows how the National Parks Service breaks its parks into different regions:
The boxplot shows:
Moving forward, we decided to try and analyze which parks had the highest numbers of visits over this time period. We created the following visualization, and then selected all parks with visits greater than 100 million into a new set. This set was then used to create our second Interesting Visualization.
The five most-visited parks are as follows:
With our Tableau analysis complete, we pulled the data into R Studio.
First, our initial National Parks Visits data was pulled:
source("../01 Data/prETLNatVisPull.R")
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: data.world
##
## Attaching package: 'data.world'
## The following object is masked from 'package:dplyr':
##
## query
## Joining, by = "State"
## [1] "Success!"
Next, the Census Data is pulled:
source("../01 Data/censusPull.R")
## [1] "Success!"
Finally, the Latitude and Longitude information is pulled:
source("../01 Data/stateLatLongPull.R")
## Warning: Duplicated column names deduplicated: 'Location' =>
## 'Location_1' [6]
## [1] "Success!"
Our final task is to perform ETL operations. This includes the removal of blank data, as well as character formatting to better meet our needs for further visualization in R.
#ETL process and export to .csv
measures <- c("GNIS", "Visitors", "YearRaw")
dimensions <- setdiff(names(prETLNatVisDF),measures)
#removes special characters in each column
for(n in names(prETLNatVisDF)) {
prETLNatVisDF[n] <- data.frame(lapply(prETLNatVisDF[n], gsub, pattern="[^ -~]",replacement= ""))
}
na2emptyString <- function (x) {
x[is.na(x)] <- ""
return(x)
}
if( length(dimensions) > 0) {
for(d in dimensions) {
# Change NA to the empty string.
prETLNatVisDF[d] <- data.frame(lapply(prETLNatVisDF[d], na2emptyString))
# Get rid of " and ' in dimensions.
prETLNatVisDF[d] <- data.frame(lapply(prETLNatVisDF[d], gsub, pattern="&",replacement= " and "))
# Change : to ; in dimensions.
prETLNatVisDF[d] <- data.frame(lapply(prETLNatVisDF[d], gsub, pattern=":",replacement= ";"))
}
}
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated
na2zero <- function (x) {
x[is.na(x)] <- 0
return(x)
}
ETLNatVisDF <- prETLNatVisDF
write.csv(prETLNatVisDF, file = "NatVisDF.csv")
print("Success!")
## [1] "Success!"
Of note here: first, National Park visits actually grew during the Great Depression and Great Recession, indicating that visits to these parks aren’t considered discretionary spending. Additionally, it could be that a poor economy has consumers scrambling to find cheaper vacations, and National Parks fill that role. Lastly, stress during difficult economic times may spur people to want to be outdoors and enjoy nature.
However, the most interesting aspect of this is the massive growth experienced in National Parks visits during “Mission 66.” Mission 66 involved a massive amount of infrastructure creation to make parks more accessible. In addition to roads and trails, it also funded the creation of camping and housing sites, and also an advertising campaign that promoted the natural beauty of these parks to citizens across the country.
This visualization highlights states with parks that have combined visits between 1904-2016 greater than 100 Million, and then lists the single most visited park within that state (in some cases, such as California, there are multiple parks with visits greater than 100 Million).
This visualization shows visits by state, filtered by year, and colored by the ratio of visits to state population. The goal here is to identify states or areas that bring in vastly more visitors than their populations. Washington, D.C., a city with a population of approximately 670,000 people, draws almost 55 times its population in visitors to its National Parks and Monuments. Another state where this occurs is Wyoming, which draws nearly 17 times its population in visits.
We took the .CSV files produced in our above ETL and Pull operations, and in turn uploaded them to Data.World. From there, our Shiny app uses SQL to query relevant data to create a variety of visualizations in R. Our Shiny deployment displays the steps taken to reach the interesting visualizations.